Research Methods 2022

Ethan Milne // Ivey Business School

September 12, 2022

Gameplan:


Why webscrape?


Scraping with APIs


Scraping without APIs

Why Webscrape?

What is Webscraping?

Scraping with APIs

What is an API?

APIs: Costs and Benefits

Twitter Study


library(academictwitteR)

bk_tweets <- get_all_tweets(
  query = "@BurgerKingUK",
  start_tweets = "2021-03-06T00:00:00Z",
  end_tweets = "2021-03-13T00:00:00Z",
  n = 10000000
)


library(academictwitteR)

bk_tweets <- get_all_tweets(
  query = "@BurgerKingUK",
  start_tweets = "2021-03-06T00:00:00Z",
  end_tweets = "2021-03-13T00:00:00Z",
  n = 10000000
)

for (i in unique_authors) {
  past_tweets[[i]] <- get_all_tweets(
    users = unique_authors$username[start:end],
    start_tweets = "2021-03-01T00:00:01.000Z",
    end_tweets = "2021-03-08T00:00:01.000Z",
    n = 1000000
  )
}


Results

Model Summaries
Characteristic Without Present Tweets With Present Tweets
IRR1 95% CI1 p-value IRR1 95% CI1 p-value
NormalQuantity (past) 0.70 0.68, 0.73 <0.001 0.74 0.71, 0.76 <0.001
NormalLikes (past) 0.96 0.93, 0.98 <0.001 0.93 0.91, 0.96 <0.001
OutrageQuantity (past) 1.72 1.64, 1.81 <0.001 1.68 1.61, 1.76 <0.001
OutrageLikes (past) 1.12 1.09, 1.15 <0.001 1.09 1.06, 1.13 <0.001
Followers 1.00 1.00, 1.00 0.4 1.00 1.00, 1.00 >0.9
TotalTweets 1.00 1.00, 1.00 <0.001 1.00 1.00, 1.00 <0.001
NormalQuantity (present) 4.08 3.95, 4.21 <0.001
1 IRR = Incidence Rate Ratio, CI = Confidence Interval

Scraping Without APIs

Building scrapers is hard


No prebuilt code


Websites not designed for scraping


Websites protected against scraping

Building scrapers is rewarding


You know your data inside and out


Your data is unique


Your scraper is a contribution

Fanfiction Study


References